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Abstract 

We  consider  a  cognitive  radio  network,  where  M  distributed  secondary  users  search  for  spectrum 
opportunities  among  N  independent  channels  without  information  exchange.  The  occupancy  of  each 
channel  by  the  primary  network  is  modeled  as  Bernoulli  process  with  unknown  mean  which  represents 
the  unknown  traffic  load  of  the  primary  network.  In  each  slot,  a  secondary  transmitter  chooses  one 
channel  to  sense  and  subsequently  transmit  if  the  channel  is  sensed  as  idle.  Sensing  is  considered  to 
be  imperfect,  i.e.,  an  idle  channel  can  be  sensed  as  busy  and  vice  versa.  Users  transmit  on  the  same 
channel  collide  and  none  of  them  can  transmit  successfully.  The  objective  is  to  maximize  the  system 
throughput  under  the  collision  constraint  imposed  by  the  primary  network  while  ensuring  synchronous 
channel  selection  between  each  secondary  transmitter  and  its  receiver.  The  performance  of  a  channel 
selection  policy  is  measured  by  the  system  regret,  defined  as  the  expected  total  performance  loss  with 
respect  to  the  optimal  performance  under  the  ideal  scenario  where  all  channel  means  are  known  to  all 
users  and  collisions  among  users  are  eliminated  throughput  perfect  scheduling.  We  show  that  the  optimal 
system  regret  rate  is  at  the  same  logarithmic  order  as  the  centralized  counterpart  with  perfect  sensing.  An 
order-optimal  decentralized  policy  is  constructed  to  achieve  the  logarithmic  order  of  the  system  regret 
rate  while  ensuring  the  fairness  among  all  users. 
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I.  Introduction 

We  consider  a  distributed  learning  problem  arisen  in  the  context  of  cognitive  radio  networks.  There 
are  multiple  distributed  secondary  users  searching  for  idle  channels  temporarily  unused  by  the  primary 
network.  We  assume  that  the  state — 1  (idle)  or  0  (busy) — of  each  channel  evolves  as  an  i.i.d.  Bernoulli 
process  across  time  slots  with  an  unknown  mean  which  represents  the  unknown  traffic  load  of  the  primary 
network.  At  the  beginning  of  each  slot,  each  secondary  transmitter  chooses  one  channel  to  sense  and 
subsequently  transmits  to  its  receiver  if  the  channel  is  sensed  as  idle.  Sensing  is  subject  to  errors:  an  idle 
channel  may  be  sensed  as  busy  and  vice  versa.  If  the  transmission  is  successful,  the  secondary  receiver 
sends  back  an  acknowledgement  (ACK)  to  the  transmitter  over  the  same  channel  at  the  end  of  the  slot.  The 
secondary  users  do  not  exchange  information  on  their  decisions  and  observations.  There  arc  two  types 
of  collisions  that  may  occur:  a  primary  collision  happens  when  a  secondary  user  transmits  in  a  busy 
channel  and  a  secondary  collision  happens  when  multiple  secondary  users  transmit  in  the  same  channel. 
In  either  case,  the  transmission  fails.  The  objective  is  to  design  a  decentralized  channel  selection  policy 
for  optimal  long-term  network  throughput  under  a  constraint  on  the  maximum  probability  of  primary 
collisions. 

Another  important  design  constraint  is  the  synchronous  channel  selection  between  each  secondary 
transmitter  and  its  receiver.  We  do  not  assume  any  dedicated  control  channel  to  coordinate  each  pair  of 
the  secondary  transmitter  and  receiver.  To  ensure  synchronization,  they  can  either  make  the  decision  based 
on  the  common  observation  history  (/.<?.,  number  of  ACKs  observed  from  each  channel)  or  exploiting 
the  idle  channels  to  exchange  control  information  to  coordinate.  The  tradeoff  involved  here  is  that  the 
information  from  ACKs  may  not  be  sufficient  for  learning  the  channel  rank  due  to  collisions  while 
additional  communications  between  a  secondary  transmitter  and  its  receiver  causes  a  sacrifice  in  the 
throughput. 

We  measure  the  performance  of  a  decentralized  policy  by  the  system  regret,  which  is  defined  as  the 
expected  total  data  loss  with  respect  to  the  optimal  performance  under  the  ideal  scenario  where  all  channel 
means  are  known  to  all  users  and  collisions  among  users  are  eliminated  throughput  perfect  scheduling. 
The  objective  is  to  minimize  the  rate  at  which  the  regret  grows  with  time.  Note  that  the  system  regret 
rate  is  a  finer  performance  measure  than  the  long-term  throughput.  All  policies  with  a  sublinear  regret 
rate  would  achieve  the  maximum  long-term  throughput.  However,  the  difference  in  their  performance 
measured  by  the  expected  total  bits  of  transmitted  data  over  a  time  horizon  of  length  T  can  be  arbitrarily 
large  as  T  grows.  It  is  thus  of  greatly  interest  to  characterize  the  minimum  regret  rate  and  construct 
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policies  optimal  under  this  liner  performance  measure. 

The  above  problem  involves  a  complicated  dilemma  of  exploitation,  exploration,  and  competition. 
Specifically,  each  user  needs  to  learn  the  channel  rank  efficiently  in  order  to  choose  the  best  channels  while 
avoiding  significant  collisions  to  other  users.  Compared  to  the  scenario  of  perfect  sensing,  learning  the 
channel  rank  under  imperfect  sensing  is  substantially  more  challenging  due  to  the  imperfect  observation 
of  channel  states  and  the  synchronization  constraint  between  each  secondary  receiver  and  its  transmitter. 

In  this  paper,  we  show  that  the  minimum  system  regret  rate  is  at  the  same  logarithmic  order  as 
the  centralized  counterpart  with  perfect  sensing.  A  decentralized  policy  is  constructed  to  achieve  this 
optimal  order.  Under  this  policy,  the  system  throughput  quickly  converges  to  the  maximum  throughput 
in  the  ideal  scenario  of  known  channel  model  and  centralized  scheduling.  The  proposed  policy  further 
achieves  the  fairness  among  users,  i.e.,  all  users  converge  to  the  same  local  throughput  at  the  same  rate  as 
time  goes  to  infinity.  Last,  we  extend  the  problem  to  general  decentralized  multi-armed  bandits  (MAB) 
with  imperfect  observation  models  where  control  information  exchange  between  the  transmitter  and  the 
receiver  is  prohibited.  A  decentralized  policy  is  proposed  to  achieve  the  0(VT)  regret  rate  with  time  T. 
Related  Work  Under  perfect  sensing,  the  cognitive  radio  network  with  unknown  Bernoulli  channel 
model  and  multiple  distributed  users  was  considered  in  [1-3].  In  [1],  a  heuristic  policy  based  on  his¬ 
togram  estimation  of  the  unknown  parameters  was  proposed.  This  policy  provides  a  linear  order  of  the 
system  regret  rate,  thus  cannot  achieve  the  maximum  throughput.  In  [2],  the  problem  is  formulated  as 
a  decentralized  MAB,  which  generalizes  the  classic  MAB  with  a  single  user  [4,5].  A  time  division  fair 
sharing  (TDFS)  framework  for  constructing  order-optimal  and  fair  decentralized  policies  is  proposed 
under  general  reward,  observation,  and  collision  models.  In  [3],  order-optimal  distributed  policies  were 
established  based  on  the  single-user  polices  proposed  in  [6].  Compared  to  the  TDFS  policies  developed 
in  [2],  the  policies  proposed  in  [3]  are  limited  to  Bernoulli  reward  models  and  cannot  achieve  fairness 
among  users.  In  [7],  a  more  general  channel  model  that  allows  each  channel  to  have  different  means  for 
different  users  is  considered  under  perfect  sensing.  A  centralized  policy  that  assumes  full  information 
exchange  and  cooperation  among  users  is  proposed  which  achieves  the  logarithmic  order  of  the  regret 
rate. 

Notation  Let  |*4|  denote  the  cardinality  of  set  A.  For  two  positive  integers  k  and  l,  define  fe0  Z  =  ( ( k  — 
1)  mod  l )  +  1,  which  is  an  integer  taking  values  from  1, 2,  •  •  •  ,  l. 
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II.  Network  Model 

Consider  the  spectrum  consisting  of  N  independent  but  nonidentical  channels  and  M  distributed 
secondary  users.  Each  user  consists  of  one  transmitter  and  one  receiver.  Let  S (t)  =  [S'i(f),  •  •  •  ,  Sjv(i)]  C 
{0,1}^  (i  >  1)  denote  the  system  state,  where  Si(t)  6  {0  (busy),  1  (idle)}  is  the  state  of  channel  i  in 
slot  t  that  evolves  as  as  an  i.i.d.  Bernoulli  process  with  unknown  mean  (i,  €  (0, 1).  We  assume  that  the 
M  largest  means  are  distinct. 

In  slot  t,  a  secondary  user  (say  user  i  (1  <  i  <  M ))  chooses  a  sensing  action  (ii(t)  £  {1,  ■  ■  ■  ,N} 
that  specifies  the  channel  (say,  channel  n)  to  sense  based  on  its  observation  and  decision  history.  Based 
on  the  sensed  signals,  the  user  detects  the  channel  state,  which  can  be  considered  as  a  binary  hypothesis 
test: 

Ho  :  Sn(t)  =  1  (idle)  vs.  Hi  :  Sn(t)  =  0  (busy). 

The  performance  of  channel  state  detection  is  characterized  by  the  receiver  operating  characteristics 
(ROC)  which  relates  the  probability  of  false  alarm  e  to  the  probability  of  miss  detection  6: 

e=  Prjdecide  HijHo  is  true},  5=  Prjdecide  Ho\Hi  is  true}. 

If  the  detection  outcome  is  Ho,  the  user  accesses  the  channel  for  data  transmission.  The  design  should 
be  subject  to  a  constraint  on  the  probability  of  accessing  a  busy  channel,  which  causes  interference  to 
the  primary  network.  Specifically,  the  probability  of  collision  Vn(t)  perceived  by  the  primary  network  in 
any  channel  and  slot  is  capped  below  a  predetermined  threshold  £,  i.e., 

Vn{t)  =  Pr(decide  Hi\Sn(t)  =  0)  =  S  <  (,  V  n,  t. 

We  should  set  the  miss  detection  probability  5  =  (  as  the  detector  operating  point  to  minimize  the  false 
alarm  probability  e.  If  multiple  users  decide  to  transmit  over  the  same  channel,  they  collide  and  no  one 
can  transmit  successfully.  In  other  words,  a  secondary  user  can  transmit  data  successfully  if  and  only  if 
the  chosen  channel  is  idle,  detected  correctly,  and  no  collision  happens.  Since  failed  transmissions  may 
occur,  acknowledgements  (ACKs)  are  necessary  to  ensure  guaranteed  delivery.  Specifically,  when  the 
receiver  successfully  receives  a  packet  from  a  channel,  it  sends  an  acknowledgement  to  the  transmitter 
over  the  same  channel  at  the  end  of  the  slot.  Otherwise,  the  receiver  does  nothing,  i.e.,  a  NAK  is 
defined  as  the  absence  of  an  ACK.  We  assume  that  acknowledgements  arc  received  without  error  since 
acknowledgements  are  always  transmitted  over  idle  channels. 
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III.  Problem  Formulation 

In  each  slot,  each  secondary  transmitter  and  its  receiver  need  to  select  the  same  channel  for  data 
transmission  without  a  dedicated  control  channel.  One  natural  way  is  that  the  transmitter  and  its  receiver 
use  the  common  local  observation  history  (ACKs/NAKs)  in  learning  and  decision  making.  However,  due 
to  the  collisions  among  secondary  users,  the  information  included  in  previously  observed  ACKs/NAKs 
may  not  be  sufficient  to  learn  the  unknown  channel  model  efficiently.  An  alternative  approach  is  to 
let  each  transmitter  decide  whether  or  not  to  send  its  receiver  the  control  information  (instead  of  the 
objective  data)  in  the  chosen  channel  for  future  synchronization.  We  consider  the  worst  scenario  that 
each  transmission  of  the  control  information  occupies  an  entire  idle  slot1.  Since  sending  the  control 
information  causes  a  sacrifice  in  the  immediate  throughput,  it  should  be  avoided  as  much  as  possible  in 
order  to  maximize  the  number  of  opportunities  for  transmitting  the  objective  data. 

We  define  a  local  policy  7 r*  for  user  i  as  a  sequence  of  functions  7Tj  =  {7u(f)}t>i,  where  iTr(t)  maps 
user  i’s  local  information  that  is  common  to  its  transmitter  and  receiver  to  the  sensing  action  ar{t)  in 
slot  t.  The  decentralized  policy  7r  is  thus  given  by  the  concatenation  of  the  local  policy  for  each  user: 
7 r  =  [711  ■  ■  ,  7r m]  ■  Define  immediate  reward  Y  (t)  as  the  total  number  of  successful  transmissions  of  the 

objective  data  by  all  users  in  slot  t: 

Y(t )  =  sf=1i 

where  I '•(/)  is  the  indicator  function  that  equals  to  1  if  channel  j  is  accessed  by  only  one  user  and  used 
for  transmitting  the  objective  data,  and  0  otherwise. 

Let  0  =  (#1, 62,  •  •  ■  ,  On)  be  the  unknown  parameter  set  and  a  a  permutation  such  that  0a^\  >  0<j(2)  > 
•  •  •  >  0a(N)-  The  performance  measure  of  a  decentralized  policy  7r  is  defined  as  the  system  regret 

f^(0)  =  TYf=l{l  -  e)0a{j)  -  Kn[Yj=lY(t)}. 

It  is  easy  to  see  that  TSj'£1(  1  —  e)0a(j)  is  the  maximum  expected  total  reward  over  T  slots  under  the 
ideal  scenario  that  the  parameter  set  0  =  (0\.  -  ■  ■  ,0n)  is  known  and  users  are  centralized. 

Note  that  the  regret  is  always  growing  with  time  since  users  can  never  identify  the  channel  parameters 
perfectly.  The  objective  is  to  minimize  the  rate  at  which  Rp(Q)  grows  with  time  T  under  any  parameter 
set  0  by  choosing  the  optimal  decentralized  policy  tt*. 

*The  results  established  in  this  paper  can  be  directly  extended  to  a  more  relaxed  piggybacking  scenario  that  assumes  that  the 
control  information  occupies  negligible  capacity  and  is  included  in  the  data  package  in  each  slot. 
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IV.  Optimal  Order  of  the  System  Regret 


In  this  section,  we  show  that  the  minimum  system  regret  rate  is  at  the  logarithmic  order  with  time, 
which  implies  that  the  system  can  achieve  the  maximum  throughput  at  a  significantly  fast  rate. 

Theorem  1:  The  optimal  order  of  the  system  regret  rate  is  logarithmic  with  time,  i.e.,  for  an  optimal 
decentralized  policy  -k*  ,  we  have,  V  0, 


L(0)  =  liminf  ^  <  limsup  ^  =  17(0) 
V  '  T—>oo  log  T  ~  T— >oo  log  T 


(1) 


for  some  constants  L(0)  and  f/(0)  that  depend  on  0. 

Proof:  To  prove  the  lower  bound,  we  consider  a  genie-aided  system  where  users  arc  centralized 
and  the  synchronization  constraint  on  each  secondary  transmitter  and  its  receiver  is  removed  from 
consideration.  Note  that  the  channel  parameters  remain  unknown  to  all  users  in  the  genie-aided  system. 
It  is  easy  to  see  that  the  problem  is  equivalent  to  the  one  with  a  single  user  that  can  sense  M  channels 
simultaneously  in  each  slot.  For  simplicity,  we  focus  on  the  latter  one.  In  each  slot,  the  user  obtains  two 
types  of  observations  from  each  chosen  channel:  the  detection  outcome  and  the  ACK/NAK.  In  Lemma  1, 
we  show  that  the  regret  rate  in  the  genie-aided  system  is  at  least  logarithmic  with  time.  The  proof  is  thus 
completed  by  noticing  that  the  minimum  regret  rate  in  the  problem  at  hand  is  lower  bounded  by  the  one 
in  the  genie  aided  system. 

Lemma  1:  Let  Rf(Q)  denote  the  regret  under  a  policy  it  in  the  genie-aided  system  over  T  slots.  If 
Rf(Q)  =  o(Tc)  V  c  >  0  and  V  0,  then,  V  0, 


r  •  f  ^t(®)  \  f-t  p{@<7 (M)) 

hmmf  - — —  >  (1  -  e)£n:  ^en)<^M)y 


T— >oo  log  T 


G{0n,0a(M)) 


(2) 


where 


Gid^dj)  =  (efl<+(l- 6)(1- ft))  log  ^  til  tul  ^+g(l~gi)log  !!!  ^+(l-e)^:log(1  ^ 


e9j  +  (1  -  8)(1  -  Oj)  v  b  6(1- 9j)  K  '  t  t3(l-e)9j 

is  the  K-L  distance  between  two  joint  distributions  of  the  detection  outcome  and  the  ACK/NAK  para¬ 
meterized  by  6-i  and  Oj,  respectively. 

Proof:  The  proof  follows  a  similar  line  to  that  of  Theorem  3.1  in  [5]  by  combining  the  detection 
outcome  and  ACK/NAK  as  a  single  observation  vector  of  an  arm.  ■ 

For  the  upper  bound,  we  show  that  their  exists  a  decentralized  policy  that  achieves  the  logarithmic 
order  of  the  growth  rate  of  the  system  regret.  See  Sec.  V  for  details.  ■ 
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V.  The  Order-Optimal  Decentralized  Policy 

In  this  section,  we  establish  an  order-optimal  and  fair  decentralized  policy  tt*f  to  achieve  the  optimal 
logarithmic  order  of  the  system  regret  rate.  The  general  structure  of  the  policy  is  based  on  the  time 
division  fair  sharing  (TDFS)  of  the  M  best  channels  among  the  M  distributed  users.  The  TDFS  structure 
is  first  proposed  in  [2]  in  the  scenario  of  perfect  sensing.  Due  to  the  imperfect  observation  of  channel 
state  and  the  synchronization  constraint,  extending  the  TDFS  framework  to  the  problem  at  hand  is  highly 
nontrivial. 

Specifically,  the  local  policy  of  each  user  consists  of  disjoint  rounds  of  playing  the  M  channels 
considered  to  be  the  best.  Different  users  have  different  offsets  in  sensing  the  sets  of  M  channels. 
Consider,  for  example,  user  1  has  offset  0.  In  each  round,  the  user  successively  senses  the  best,  second 
best,  •  •  • ,  and  the  Mth  best  channels  it  considers  to  be.  The  offset  in  each  user’s  round-robin  schedule 
can  be  predetermined  (e.g.,  based  on  the  user’s  ID). 

To  achieve  the  optimal  order  of  the  system  regret  rate,  it  is  crucial  that  each  user  efficiently  learns  and 
senses  the  M  best  channels  in  the  correct  order  while  ensuring  the  synchronization  between  each  trans¬ 
mitter  and  its  receiver  without  significant  communication  overhead.  We  first  propose  a  synchronization 
mechanism  for  each  transmitter  and  its  receiver.  Based  on  the  symmetry  among  users,  it  is  sufficient  to 
consider  one  user,  say,  user  1.  We  assume  that  its  transmitter  and  receiver  have  a  simple  initial  setup 
for  synchronization,  e.g.,  in  the  first  round,  they  will  both  tune  to  channel  1,  2,  •  •  • ,  M  (i.e.,  the  initial 
channel  rank  of  the  M  channels  considered  to  be  the  best  is  (1,2,  •  •  •  ,  M)).  As  shown  in  Fig.  1,  if  an 
ACK  is  observed,  the  transmitter  will  update  the  channel  rank  according  to  its  sensing  and  detection 
history.  If  the  updated  channel  rank  is  different  from  the  current  one,  the  transmitter  will  keep  sending  its 
receiver  the  updated  channel  rank  (instead  of  the  objective  data)  until  the  channel  is  successfully  received 
(i.e.,  a  new  ACK  is  observed).  For  simplicity  of  presentation,  we  assume  that  the  channel  capacity  is 
enough  to  send  the  channel  rank  in  one  slot  when  it  is  idle2.  Based  upon  a  successful  reception  of  the 
updated  channel  rank,  the  transmitter  and  receiver  will  use  this  new  channel  rank  for  channel  sensing 
in  the  next  round.  If  there  is  no  new  channel  rank  received,  they  will  keep  using  the  previous  one.  We 
point  out  that  each  round  the  transmitter  only  updates  the  channel  rank  once  based  on  the  first  ACK  (if 
exists)  received  in  this  round. 

Next,  we  consider  the  learning  of  channel  rank  at  the  transmitter  whenever  an  update  is  required.  The 

2Note  that  the  channel  rank  consists  of  integer  values  and  only  needs  finite  capacity  to  transmit.  If  the  channel  capacity  is 
not  enough  to  send  the  channel  rank  in  one  slot,  the  transmitter  will  send  the  channel  rank  in  multiple  slots. 
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basic  approach  is  to  reducing  the  problem  to  the  one  with  the  perfect  observation  model  as  considered 
in  [2],  Note  that  the  transmitter  only  uses  the  detection  outcome  (not  ACKs/NAKs)  to  learn  the  channel 
order  at  each  update.  Since  the  mean  of  the  detection  outcome  from  a  channel  (say,  channel  n )  is  equal 
to  (1  —  e)6n,  the  channel  rank  ordered  by  their  state  means  is  the  same  as  that  ordered  by  their  detection 
means.  We  can  thus  treat  the  detection  outcome  as  the  new  state  of  each  channel  in  learning  the  channel 
rank.  Consequently,  the  observation  of  the  new  state  becomes  perfect.  The  user  then  adopts  a  procedure 
analogous  to  that  in  [2]  to  identify  the  set  of  the  M  best  channels.  Basically,  the  user  first  identifies  the 
best  channel  by  applying  a  single-user  policy  (say,  Lai-Robbins  policy  established  in  [4])  for  the  classic 
MAB.  To  identify  the  kth  (1  <  k  <  M)  best  channel,  the  user  removes  the  k  —  1  channels  considered 
to  have  a  higher  rank  than  other  channels  and  apply  Lai-Robbins  policy  to  the  remaining  N  —  k  +  1 
channels.  The  main  difference  here  to  that  in  [2]  is  that  the  user  needs  to  identify  the  entire  rank  of  the 
M  best  channels  in  one  shot  (as  the  first  ACK  is  observed  in  the  current  round)  and  the  channel  sensing 
under  this  rank  can  not  be  realized  until  the  round  before  which  the  receiver  has  successfully  received 
this  rank  information  and  no  newer  update  has  been  received.  Establishing  the  efficiency  for  learning  the 
channel  rank  is  thus  more  challenging  compared  to  the  scenario  addressed  in  [2], 

A  detailed  implementation  of  the  decentralized  policy  ttf  is  given  in  Fig.  2. 

Theorem  2:  Under  the  decentralized  policy  ir*F,  we  have 

R^F  (0) 


limsup- 

T^oo  log  T 


=  C(0) 


(3) 


for  some  constant  (7(0)  that  depends  on  0. 

Proof:  Note  that  the  set  of  slots  in  which  a  reward  loss  occurs  is  a  subset  of  slots  in  which  there  exist 
a  user  that  does  not  sense  the  correct  channel  or  a  transmitter  that  sends  the  channel  rank  information 
instead  of  the  objective  data.  It  is  thus  sufficient  to  prove  the  expected  number  of  slots  that  a  user  does 
not  sense  the  M  best  channels  in  a  correct  order  or  its  transmitter  sends  the  channel  rank  information 
to  the  receiver  is  at  most  logarithmic  with  time.  Without  loss  of  generality,  consider  user  1.  We  first 
present  the  following  lemma,  which  shows  that  the  expected  number  of  times  that  the  transmitter  does 
not  update  the  channel  rank  correctly  is  at  most  logarithmic  with  time. 

Lemma  2:  Let  fu(T)  denote  the  number  of  times  that  the  channel  rank  is  updated  incorrectly  at  the 


transmitter,  we  have 
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for  some  constant  V(@)  that  depend  on  0. 

Proof:  See  Appendix  A  for  details.  ■ 

Now  we  show  that  the  expected  number  of  rounds  that  a  user  does  not  sense  the  M  best  channels 
in  a  correct  order  is  at  most  logarithmic  with  time.  Note  that  expected  number  of  slots  between  two 
successive  updates  at  the  transmitter  is  uniformly  bounded  by  some  constant.  So  the  expected  number 
of  successive  rounds  that  the  user  does  not  sense  the  M  best  channel  in  the  correct  order  caused  by  the 
previous  incorrect  update  is  uniformly  bounded  by  some  constant.  The  expected  number  of  rounds  that 
the  user  does  not  sense  the  M  best  channels  in  a  correct  order  is  thus  at  the  same  order  as  that  of  the 
incorrect  updates  on  the  channel  rank  at  the  transmitter,  which  is  at  most  logarithmic  with  time  based 
on  Lemma  2. 

Note  that  the  transmitter  only  needs  to  send  its  receiver  an  update  if  the  the  new  channel  rank  is 
different  from  the  current  one.  Except  that  the  channel  rank  is  updated  incorrectly,  the  updated  channel 
ranks  are  all  the  same.  By  noticing  that  the  expected  number  of  times  that  the  channel  rank  is  updated 
incorrectly  is  at  most  logarithmic  with  time,  the  expected  number  of  times  that  the  transmitter  needs 
to  send  its  receiver  the  updated  channel  rank  information  is  at  most  logarithmic  with  time.  Since  each 
sending  duration  till  a  successful  reception  is  uniformly  bounded  in  expectation,  the  expected  number  of 
slots  that  the  transmitter  sends  its  receiver  the  updated  channel  rank  information  is  at  most  logarithmic 
with  time. 

We  thus  proved  Theorem  2. 


Based  on  the  symmetry  among  users’  local  policies,  we  show  that  tt*f  achieves  the  fairness  among  all 
users. 

Theorem  3:  Define  the  local  regret  for  user  i  under  tt*f  as 

#*■*(©)  =  W)}, 


where  Yft)  is  the  immediate  reward  obtained  by  user  i  in  slot  t.  We  have 


limsup 
T^oo  log  T 


Rnr.fQ)  1  _ R^F(@) 


=  T7  hm  sup  ■  ,  ^ 

M  log  T 


V  i  €  {!,•••  ,M}. 


VI.  Extension  to  General  Decentralized  MAB  with  Imperfect  Observation  Models 

In  this  section,  we  formulate  the  decentralized  MAB  with  imperfect  observation  models  that  generalizes 
the  cognitive  radio  problem  considered  in  previous  sections. 
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Sensing  action 


Use  the  received  updated  channel  rank  for  sensing 

A  A 

Round  1  Round  2  Round  3  Round  4  Round  5 


NAK  ACK  NAK  NAK  ACK  NAK  ACK  ACK 


T 


update  the  channel  rank  at  the  tx  updated  channel  rank  received  update  the  channel  rank  at  the  tx 
new  channel  rank:  (2,3)  update  the  channel  rank  at  the  tx  new  channel  rank:  (1,3) 

new  channel  rank:  (1,  2),  same  as  current  one  updated  channel  rank  received 


1 — T — 1 1 - 1 

1 

Sending  the  updated  channel  rank 

Fig.  1.  An  example  of  the  structure  of  user  l’s  local  policy  under  (M  =  2,  N  =  3,  tx:  transmitter). 


In  general,  there  exist  M  distributed  players  (users)  and  N  arms  (channels)  in  the  system.  The  reward 
that  each  arm  can  offer  is  an  i.i.d.  process  with  unknown  mean.  In  each  slot,  each  player  decides  to 
play  one  arm  based  on  its  local  observation  and  decision  history.  If  multiple  players  choose  the  same 
arm  to  play,  the  reward  obtained  and  observed  by  each  of  them  will  be  distorted  in  an  arbitrary  way 
(either  deterministically  or  statistically).  The  cognitive  radio  problem  considered  in  previous  sections  can 
be  considered  as  a  special  case  of  the  general  model,  where  sensing  a  channel  corresponds  to  playing 
an  arm  and  the  reward  on  each  arm  is  given  by  its  state.  We  point  out  that,  in  general,  there  is  no 
‘transmitter’  that  can  ‘sense’  the  arm  state  without  being  affected  by  collisions. 

To  design  an  optimal  decentralized  policy  under  the  general  imperfect  observation  model,  the  local 
observation  history  of  each  user  needs  to  be  filtered  to  extract  trustable  information  for  learning  the  arm 
rank.  This  could  involve  a  complicated  change  detection  problem  and  the  minimum  system  regret  rate 
may  not  achieve  the  logarithmic  order.  In  this  section,  we  propose  a  simple  policy  tt9f  to  achieve  the 
0(VT)  regret  rate  with  time  T  while  ensuring  the  fairness  among  all  players.  The  following  assumptions 
will  be  adopted. 

Al.  The  means  of  the  M  best  arms  are  nonnegative  and  distinct. 

A2.  The  variance  of  the  reward  from  each  arm  is  finite. 

The  basic  idea  in  n9F  is  to  constructing  a  deterministic  sequence  in  which  the  collisions  among  players 
are  perfectly  avoided.  In  this  sequence,  each  user  plays  each  of  the  N  arms  in  a  round  robin  fashion 
with  a  different  offset.  Each  user  computes  the  sample  mean  of  each  arm  solely  based  on  the  reward 
obtained  in  the  slots  that  belong  to  this  sequence.  In  other  slots  that  do  not  belong  to  this  sequence, 
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The  Decentralized  Policy  it*F 

Without  loss  of  generality,  consider  user  i. 

•  Notations  and  Inputs:  let  6n(t)  denote  the  detection  mean  obtained  from  channel  n  at  the 
transmiiter  and  Tnj  the  number  of  times  that  channel  n  is  sensed  up  to  (but  excluding)  slot  t.  Let 
1(6,  9')  =  0\og(0 /O')  +  (1  —  0)  log((l  —  9)/(  1  —  6'))  denote  the  K-L  distance  between  the  Bernoulli 
distributions  parameterized  by  6  and  O',  respectively.  User  i  first  senses  each  channel  once  in  the 
first  N  slots  to  establish  initial  observations.  Starting  from  slot  t  +  1,  user  is  local  policy  consists 
of  disjoint  rounds  of  sensing  the  M  channels  considered  to  be  the  best.  Let  Q/,  denote  the  channel 
sensing  order  in  the  /.:th  round.  Let  14 k  denote  the  number  of  updates  of  channel  rank  at  the  transmitter 
up  to  (and  including)  round  k.  Initially,  Qi  =  (1, 2,  ■  ■  ■  ,  M)  and  Uq  =  0.  Select  a  b  (0  <  b  <  1  /N). 

•  In  the  kth  round,  user  i  does  the  following. 

1.  Both  the  transmitter  and  receiver  sense  the  channels  considered  to  be  the  M  best  in  turn  according 
to  Qk-  If  an  ACK  is  observed  and  this  is  the  first  ACK  observed  in  this  round,  the  transmitter 
set  U k  =  Z4-i  +  1  and  updates  the  rank  of  the  M  channels  considered  to  be  the  best  according 
to  step  2.  The  transmitter  sends  the  receiver  the  updated  channel  rank  if  it  is  different  from  Qk 
until  the  next  ACK  observed.  If  the  receiver  received  a  packet  consisting  of  the  updated  channel 
rank  previously  sent  by  the  transmitter,  the  receiver  sends  back  an  ACK  and  both  the  transmitter 
and  receiver  set  Qk+i  equal  to  the  updated  channel  rank;  otherwise  Qk+i  =  Qk- 

2.  First,  the  transmitter  identifies  the  best  channel.  Let  t  denote  the  current  time.  The  user  chooses 
between  a  leader  lt  and  a  round-robin  candidate  rt  =£40  N,  where  the  leader  lt  is  the  channel 
with  the  largest  detection  mean  among  all  channels  that  have  been  sensed  for  at  least  (24  —  1)6 
times.  The  user  chooses  the  leader  lt  as  the  best  if  0jt (t)  >  0rt (t)  and  I(0rt(t),0it(t ))  > 
log(f  —  l)/rr,,u  otherwise  the  user  chooses  the  round-robin  candidate  ry  as  the  best.  To  identify 
the  kth  (k  >  1)  best  channel,  the  user  removes  the  set  of  k  —  1  channels  considered  to  have  a 
higher  rank  than  others  from  the  channel  set  and  then  chooses  between  a  leader  and  a  round- 
robin  candidate  defined  within  the  remaining  channels.  Specifically,  let  m(t)  denote  the  number 
of  times  that  the  same  set  of  the  4  —  1  channels  is  removed.  Among  all  channels  that  have  been 
sensed  for  at  least  (m(t)  —  1)6  times,  let  lt  denote  the  leader  with  the  largest  detection  mean.  Let 
rt  =  m(t)  0  (N  —  k  +  1)  be  the  round-robin  candidate  where,  for  simplicity,  we  have  assumed 
that  the  remaining  channels  are  indexed  by  1,  ■  ■  ■  ,  N  —  k  +  1.  The  user  chooses  the  leader  lt  as 
the  kth  best  if  Oit(t)  >  0rt(t)  and  I(6rt(t),0it(t ))  >  log (t  —  1  )/Trtjt,  otherwise  the  user  chooses 
the  round-robin  candidate  r,  as  the  kth  best. 


Fig.  2.  The  decentralized  policy  np. 
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each  user  plays  the  M  arms  that  have  the  largest  sample  means  in  a  round  robin  fashion  with  a  different 
offset.  Since  each  user  is  obligated  to  play  each  of  the  N  arms  in  the  deterministic  sequence.  The  system 
regret  rate  is  at  least  at  the  order  of  the  number  of  slots  in  this  sequence  as  time  grows.  In  other  words, 
the  density  of  the  sequence  should  be  small  enough.  However,  the  density  of  the  sequence  should  also 
be  large  enough  to  ensure  an  efficient  learning  of  the  arm  rank.  This  tradeoff  between  exploitation  and 
exploration  needs  to  be  properly  addressed  in  the  policy  design.  We  show  that  by  choosing  a  sequence 
of  which  the  cardinality  grows  at  the  order  0{\/T )  with  time  T,  the  system  regret  rate  can  achieve  the 
same  growth  rate  of  its  cardinality. 

A  detailed  implementation  of  nF  is  given  in  Fig.  3. 

Theorem  4:  For  the  general  decentralized  MAB  with  imperfect  observation  models,  the  system  regret 
rate  under  i t9f  is  at  the  order  0(\/T).  Furthermore,  irF  achieves  the  fairness  among  all  users,  i.e.,  the 
local  regret  rate  of  each  user  is  the  same. 

Proof:  Let  D(t)  denote  the  number  of  slots  in  the  deterministic  sequence  up  to  (but  excluding)  time 
t.  Let  0n  it  )  denote  the  sample  mean  of  channel  n  based  on  the  observations  in  the  deterministic  sequence 
up  to  (but  excluding)  time  t.  From  [4],  for  i.i.d.  random  variables  { L|  ■  f  2 •  • ' '  }  with  a  finite  variance, 

Pr(|£(Yi)  -  (Sf=1Y;)A'|  >  e)  =  o{k~l ),  V  e  >  0.  Choose  0  <  e  <  min{^  -Q3  :  l  <  i  <  j  <  N,  6if 
6j}/ 2.  We  thus  have, 

Rtf(Q )  <  ^t=\0(Tif=l  Pr(|#j  —  6n(t)\  >  e))  +  0(D(T))  (5) 

=  T,J=1o(l/D(t))  +  0(D(T))  (6) 

=  E?=1o(l/t1/2)  +  0(T1/2).  (7) 

Note  that 

Z?=1o(l/t1/2)  =o([  t1/2dt )  =  o(T1/2).  (8) 

Jt= l 

From  (5)  and  (8),  Rff{&)  =  0(D{T ))  =  OiT1/2).  ■ 

VII.  Simulation  Examples 

In  this  section,  we  study  the  performance  of  the  order  optimal  policy  nF  for  the  cognitive  radio  problem 
and  the  policy  ttf  for  the  general  decentralized  MAB  with  imperfect  observation  models. 

A.  The  Performance  of  n  *F  for  the  Cognitive  Radio  Network 

We  consider  the  scenario  that  both  the  channel  noise  and  the  signal  of  the  primary  network  are  white 
Gaussian  processes  with  zero  mean  but  different  power  densities.  The  energy  detector  is  adopted  that 
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The  Decentralized  Policy  nF 

Without  loss  of  generality,  consider  user  i. 

•  Notations  and  Inputs:  Let  -4(f)  denote  the  index  set  of  slots  that  belong  to  the  deterministic 
sequence  up  to  (and  including)  time  t.  Let  D(t)  =  \A(t)\.  Initially,  -4(1)  =  {1}  and 
D(  1)  =  1.  Select  an  a  >  0. 

•  At  time  f,  user  i  does  the  following. 

1.  If  at1/2  >  D(t)  and  t  ^  -4(f),  include  t  in  -4(f),  set  D(t )  =  D(t  —  1)  +  1,  and  then  go 
to  step  2.  For  the  case  that  af1/2  <  D(t),  go  to  step  2  if  f  G  -4(f)  and  step  3  otherwise. 

2.  User  i  plays  the  (U)(t)  +*  —  1)0  N) th  arm  and  update  the  sample  mean  of  this  arm. 

3.  User  i  plays  the  arm  with  the  {(t  +  i  —  1)  0  M)th  largest  sample  mean. 

Fig.  3.  The  decentralized  policy  7rf,. 


is  optimal  under  the  Neyman-Pearson  criterion  [8].  In  Fig.  4,  we  observe  that  the  regret  converges 
quickly  as  time  goes.  In  Fig.  5,  we  plot  the  constant  of  the  logarithmic  order  as  a  function  of  N.  We 
observe  that,  from  this  example,  the  system  performs  better  for  smaller  detection  errors.  Furthermore,  the 
system  performance  is  not  monotonic  as  the  number  of  channels  increases.  This  is  due  to  the  tradeoff 
that  as  N  increases,  users  are  less  likely  to  collide  but  learning  the  M  best  channels  becomes  more 
difficult.  In  Fig.  6,  we  plot  the  constant  of  the  logarithmic  order  as  a  function  of  M.  We  observe  that 
the  system  performance  degrades  as  M  increases.  This  is  due  to  the  increased  competitions  and  learning 
load  encountered  by  all  users. 

B.  The  Performance  of  n9F  for  the  General  Model 

W  compare  the  performance  of  nF  by  setting  different  values  of  parameter  a  (see  Fig.  3),  which 
equals  to  the  constant  of  the  0(VT)  order  of  the  cardinality  of  the  deterministic  sequence.  Each  arm 
has  a  Bernoulli  reward  distribution.  Intuitively,  we  want  to  choose  a  small  a  since  the  regret  rate  is  equal 
to  rate  of  the  cardinality  of  the  sequence.  However,  from  Fig.  7,  we  observe  that  the  regret  under  the 
smaller  parameter  converges  at  a  much  slower  rate  than  that  under  the  larger  parameter.  This  is  due  to 
the  fact  that  for  any  arm,  the  convergence  of  the  sample  mean  to  the  true  mean  is  not  fast  enough  in 
terms  of  the  number  of  samples.  It  is  thus  better  to  choose  a  fairly  large  parameter  when  considering  the 
short-horizon  performance. 
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Time  (T) 


Fig.  4.  The  Convergence  of  the  regret  (M  =  2,  N  =  9,  0  =  [0.1,  0.2,  ■  ■  •  ,  0.9],  t  =  0.0854,  8  =  0.1,  (primary)  signal 
to  noise  ratio=5db). 


VIII.  Conclusion 

In  this  paper,  we  formulated  the  distributed  learning  problem  in  cognitive  radio  networks  under 
imperfect  sensing.  The  optimal  system  regret  rate  is  shown  to  be  at  the  logarithmic  order.  An  order- 
optimal  decentralized  policy  is  proposed  to  achieve  the  logarithmic  order  of  the  regret  rate  and  thus  lead 
to  a  fast  convergence  to  the  maximum  throughput  in  the  ideal  scenario  of  known  channel  model  and 
centralized  users.  Furthermore,  the  cognitive  radio  example  is  extended  to  the  general  decentralized  MAB 
with  imperfect  observation  models.  A  simple  decentralized  policy  is  proposed  under  this  general  model 
to  achieve  the  0(\/T)  order  of  the  system  regret  rate  as  T  — >  oo. 

Appendix  A.  Proof  of  Lemma  2 

We  prove  by  induction  on  selecting  the  M  best  channels.  Specifically,  it  is  sufficient  to  show  that, 
given  that  the  (i  —  l)th  best  channels  are  correctly  selected,  the  expected  number  of  updates  that  the  ith 
best  channel  is  not  selected  correctly  is  at  most  logarithmic  with  time  for  all  1  <  i  <  M. 

Let  K  denote  the  total  number  of  updates  over  the  horizon  of  T  slots.  Let  V[K)  denote  the  set 
of  updates  at  which  the  ( i  —  l)th  best  channels  are  correctly  selected  up  to  the  K th  update.  For  any 
a  E  (O./i^j)  i+1A),  let  Ni(K)  denote  the  number  of  updates  in  V(K)  at  which  channel  n  is 
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Fig.  5.  The  performance  of  np  (T  =  5000,  M  =  2,  0  =  [0.1,  0.2,  •  ■  ■  ,  jjj],  SNR:  (primary)  signal  to  noise  ratio). 


selected  as  the  ith  best  when  lt  =  cr(i )  and  \6it(t)  —  (1  —  e)9it(t)\  <  a  (t  is  the  update  time),  N-^iK) 
the  number  of  updates  in  V(K)  at  which  channel  n  is  selected  as  the  ith  best  when  lt  =  a(i)  and 
1 9it(t)  —  (1  —  e)6it(t)\  >  a,  and  N^{K)  the  number  of  updates  in  V{K)  when  lt  /  <r(i).  It  is  sufficient 
to  show  that  E[A^i(A")],  and  E[A3(/i)]  are  all  at  most  in  the  order  of  logT. 

Consider  first  E[Ari  (T)].  We  have 

E[iVi(^)]  =  0(E[|{1  <  k  <  K  :  k  £  {V(K)},  0it  =  \9it{t)  —  (1  —  e)9it(t) \  <  a,  and  the  fcth  update  is  realized}|]) 

—  0(E[|{1  <  j  <  T  —  1  :  9n  ^  sampieS)  >  9C r(j)  —  a  or  I(9n  ^  samples)’  —  a)  —  ^°s(^  ~  l)/j}|]) 

<  O(logT),  (9) 

where  the  first  equality  is  due  to  the  fact  that  the  probability  that  each  update  will  be  realized  for 
channel  sensing  is  lower  bounded  by  some  constant  non-zero  probability,  the  first  inequality  is  due  to 
the  structure  of  the  local  policy  of  tt*f,  and  the  second  inequality  follows  the  property  of  Bernoulli 
distributions  established  in  [4]. 

Consider  E[iV2(A”)].  Since  the  number  of  observations  obtained  from  lt  at  the  sth  (V  1  <  s  <  T) 
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Fig.  6.  The  performance  of  np  (T  =  5000,  N  =  9,  0  =  [0.1,  0.2,  •  •  •  ,  0.9],  SNR:  (primary)  signal  to  noise  ratio). 

update  is  at  least  (s  —  1)6,  we  have  that,  V  1  <  s  <  T, 

Pr{at  the  sth  update,  9it  =  9a^,  \9it(t)  -  (1  -  e)9it(t) \  >  a} 

<  Pr{  sup  1 9lt  sampies)  -  (1  -  e)0it(t) \  >  a} 

j>b(s- 1) 

=  S“06*o(s-1) 

=  o(s_1),  (10) 

where  the  first  equality  is  due  to  the  property  of  Bernoulli  distributions  established  in  [4]. 

We  thus  have, 

K[N2(K)}  =  E( | { 1  <  k  <  K  :  ke  V(I<),  9lt  =  9a(i),  1 9lt(t)  -  (1  -  e)9lt(t) |  >  a}\) 

<  Ef=1  Prjat  the  sth  update,  9h  =  9a{l) ,  1 9k(t)  -  (1  -  e)9h(t)\  >  a} 

=  o(logT).  (11) 

Next,  we  show  that  K[N^(K)]  =  o(logT). 

Choose  0  <  ai  <  {n{9a(i))  —  [i(9a(i+ 1)))/2  and  c  >  (1  — 1V6)-1.  For  r  =  0, 1,  ■  •  • ,  define  the  following 
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Fig.  7.  The  convergence  of  the  regret  (M  =  2,  N  =  9,  0  =  [0.1,  0.2,  •  •  •  ,  0.9]). 


events. 

=  nj<n<Ar{^max^  |0CT(n)  samples)  —  ®<r(,n)  I  —  ai}> 

=  (j  samples)  —  ^(i)  —  ai  or  (j  samples) >  ^(0  —  ai)  —  l°§(sm  —  ^)/j 

for  all  1  <  j  <  bm,  cr~ 1  <  m  <  cr+1,  and  sm  >  m}.  (12) 

By  (10),  we  have  Pr(/fr)  =  o{c~r).  Consider  the  following  event: 

Cr  =  R(i)  u  samples)  >  dAi)  ~  ai  or  fK(i)  (j  samples )  A(i)  “  «i)  <  ^g{m)/j  for  all  1  <  j  <  bm ,  c^1  <  m  <  cr+1}.  (13) 

We  have  that  Br  D  Cr.  From  Lemma  1  —  (i)  in  [4],  Pr(CV)  =  o{c~r).  We  thus  have  Pr (Br)  =  o(c~r). 

Consider  the  sth  update  where  cr_1  <  s  —  1  <  cr+1.  When  the  round-robin  candidate  rt  =  <r(z),  we 
show  that  on  the  event  Ar  n  Br,  a(i)  must  be  selected  as  the  ith  best.  It  is  sufficient  to  focus  on  the 
nontrivial  case  that  9\t  <  9a^y  Since  >  (s  —  1)6,  on  Ar,  we  have  9it(t)  <  9am  —  a\.  We  also  have, 
on  Ar  n  Br, 

9a(i)(t)  >  9a{{)  -  ai  or  J(0ff(i)  (t ) ,  0ff(i)  -  on)  <  log(t  -  l)/rCT(i))t.  (14) 

Channel  tr(i)  is  thus  selected  as  the  ith  best  on  ArnBr.  Since  (1  —  c~1)/N  >  b,  for  any  cr  <  s— 1  <  cr+1, 

there  exists  an  rp  such  that  on  Ar  n  Br,  ra^  t  >  (1  /N)(s  —  cr_1  —  2N)  >  bs  for  all  r  >  rp.  It  thus 
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follows  that  on  Ar  n  Br,  for  any  cr  <  s  —  1  <  cr+1,  we  have  Tctp)t  >  (s  —  1)6,  and  a(i)  is  thus  the 
leader.  We  have,  for  all  r  >  r q, 

Pr(at  the  sth  update,  cr~1  <  s  —  1  <  cr+1  ,  It  /  cr (*))  <  Pr(z[r)  +  Pr (Br)  =  o(c~r).  (15) 

Therefore, 


E[N3(K)}  =  E[|{1  <  k  <  K  :  k  G  V(K),  lt  +  <r(i)}|] 

<  ^=i  Pr(at  the  sth  update,  lt  /  cr(i)) 

<  1  +  Scr<s_ [<c*-+i  Pr(at  the  sth  update,  lt  /  cr(i)) 

=  l  +  sST1o(l) 


=  o(logT). 


(16) 


From  (9),  (11),  (16),  we  arrive  at  Lemma  2. 
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